Search CORE

18 research outputs found

Using an Adaptive HPC Runtime System to Reconfigure the Cache Hierarchy

Author: Ehsan Totoni
Josep Torrellas
Laxmikant V. Kale
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/12/2014
Field of study

The cache hierarchy often consumes a large portion of a processor’s energy. To save energy in HPC environments, this paper proposes software-controlled reconfiguration of the cache hierarchy with an adaptive runtime system. Our approach addresses the two major limitations associated with other methods that reconfigure the caches: predicting the application’s future and finding the best cache hierarchy configuration. Our approach uses formal language theory to express the application’s pattern and help predict its future. Furthermore, it uses the prevalent Single Program Multiple Data (SPMD) model of HPC codes to find the best configuration in parallel quickly. Our experiments using cycle-level simulations indicate that 67 % of the cache energy can be saved with only a 2.4 % performance penalty on average. Moreover, we demonstrate that, for some applica-tions, switching to a software-controlled reconfigurable streaming buffer configuration can improve performance by up to 30 % and save 75 % of the cache energy. I

CiteSeerX

Crossref

Parallelizing Julia with a Non-Invasive DSL

Author: Anderson Todd A.
Kuper Lindsey
Liu Hai
Shpeisman Tatiana
Totoni Ehsan
Vitek Jan
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st European Conference on Object-Oriented Programming (ECOOP 2017)
Publication date: 01/01/2017
Field of study

Computational scientists often prototype software using productivity languages that offer high-level programming abstractions. When higher performance is needed, they are obliged to rewrite their code in a lower-level efficiency language. Different solutions have been proposed to address this trade-off between productivity and efficiency. One promising approach is to create embedded domain-specific languages that sacrifice generality for productivity and performance, but practical experience with DSLs points to some road blocks preventing widespread adoption. This paper proposes a non-invasive domain-specific language that makes as few visible changes to the host programming model as possible. We present ParallelAccelerator, a library and compiler for high-level, high-performance scientific computing in Julia. ParallelAccelerator\u27s programming model is aligned with existing Julia programming idioms. Our compiler exposes the implicit parallelism in high-level array-style programs and compiles them to fast, parallel native code. Programs can also run in "library-only" mode, letting users benefit from the full Julia environment and libraries. Our results show encouraging performance improvements with very few changes to source code required. In particular, few to no additional type annotations are necessary

Dagstuhl Research Online Publication Server

Parallelizing Julia with a Non-Invasive DSL (Artifact)

Author: Anderson Todd A.
Kuper Lindsey
Liu Hai
Shpeisman Tatiana
Totoni Ehsan
Vitek Jan
Publication venue: DARTS - Dagstuhl Artifacts Series. DARTS, Volume 3, Issue 2
Publication date: 01/01/2017
Field of study

This artifact is based on ParallelAccelerator, an embedded domain-specific language (DSL) and compiler for speeding up compute-intensive Julia programs. In particular, Julia code that makes heavy use of aggregate array operations is a good candidate for speeding up with ParallelAccelerator. ParallelAccelerator is a non-invasive DSL that makes as few changes to the host programming model as possible

Dagstuhl Research Online Publication Server

Power, Reliability, Performance: One System to Rule Them All

Author: Acun Bilge
Kalé Laxmikant
Langer Akhil
Meneses-Rojas Esteban
Menon Harshitha
Sarood Osman
Totoni Ehsan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

En un diseño basado en el marco de programación paralelo Charm ++, un sistema de tiempo de ejecución adaptativo interactúa dinámicamente con el administrador de recursos de un centro de datos para controlar la energía mediante la programación inteligente de trabajos, la reasignación de recursos y la reconfiguración de hardware. Gestiona simultáneamente la fiabilidad al enfriar el sistema al nivel óptimo de la aplicación en ejecución y mantiene el rendimiento a través del equilibrio de carg

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional del Instituto Tecnologico de Costa Rica

Energy-efficient computing for HPC workloads on heterogeneous manycore chips

Author: Akhil Langer
Ehsan Totoni
Laxmikant V. Kalé
Udatta S. Palekar
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

Power and energy efficiency is one of the major challenges to achieve exascale computing in the next several years. While chips operating at low voltages have been studied to be highly energy-efficient, low voltage operations lead to heterogeneity across cores within the microprocessor chip. In this work, we study chips with low voltage operation and discuss programming systems, and performance modeling in the presence of heterogeneity. We propose an integer linear programming based approach for selecting optimal configu-ration of a chip that minimizes its energy consumption. We obtain an average of 26 % and 10.7 % savings in energy con-sumption of the chip for two HPC mini-applications- min-iMD and Jacobi, respectively. We also evaluate the energy savings with execution time constraints, using the proposed approach. These energy savings are significantly more than the savings by sub-optimal configurations obtained from heuristics

CiteSeerX

Crossref

Parallel Programming with Migratable Objects: Charm++ in Practice

Author: Acun Bilge
Gupta Abhishek
Jain Nikhil
Kale Laxmikant
Langer Akhil
Menon Harshitha
Mikida Eric
Ni Xiang
Robson Michael
Sun Yanhua
Totoni Ehsan
Wesolowski Lukasz
Publication venue: Smith ScholarWorks
Publication date: 16/01/2014
Field of study

The advent of petascale computing has introduced new challenges (e.g. Heterogeneity, system failure) for programming scalable parallel applications. Increased complexity and dynamism in science and engineering applications of today have further exacerbated the situation. Addressing these challenges requires more emphasis on concepts that were previously of secondary importance, including migratability, adaptivity, and runtime system introspection. In this paper, we leverage our experience with these concepts to demonstrate their applicability and efficacy for real world applications. Using the CHARM++ parallel programming framework, we present details on how these concepts can lead to development of applications that scale irrespective of the rough landscape of supercomputing technology. Empirical evaluation presented in this paper spans many miniapplications and real applications executed on modern supercomputers including Blue Gene/Q, Cray XE6, and Stampede

Smith College: Smith ScholarWorks

Structure-adaptive parallel solution of sparse triangular linear systems

Author: Aliaga
Baker
Davis
Davis
Ehsan Totoni
Falgout
Heath
Higham
Kale
Laxmikant V. Kale
Li
Mayer
Michael T. Heath
Raghavan
Saad
Saltz
Wolf
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Simulation-based performance analysis and tuning for future supercomputers

Author: Totoni Ehsan
Publication venue
Publication date: 01/12/2011
Field of study

Hardware and software co-design is becoming increasingly important due to complexities in supercomputing architectures. Simulating applications before there is access to the real hardware can assist machine architects in making better design decisions that can optimize application performance. At the same time, the application and run-time can be optimized and tuned beforehand. BigSim is a simulation-based performance prediction framework designed for these purposes. It can be used to perform packet-level network simulations of parallel applications using existing parallel machines. In this thesis, we demonstrate the utility of BigSim in analyzing and optimizing parallel application performance for future systems based on the PERCS network. We present simulation studies using benchmarks and real applications expected to run on future supercomputers. Future peta-scale systems will have more than 100,000 cores, and we present simulations at that scale

Illinois Digital Environment for Access to Learning and Scholarship Repository

Power and energy management of modern architectures in adaptive HPC runtime systems

Author: Totoni Ehsan
Publication venue
Publication date: 01/12/2014
Field of study

Power and energy efficiency are important challenges for the High Performance Computing (HPC) community. Excessive power consumption is a main limitation for further scaling of HPC systems, and researchers believe that current technology trends will not provide Exascale performance within a reasonable power budget in near future. Hardware innovations such as the proposed Exascale architectures and Near Threshold Computing are expected to improve power efficiency significantly, but more innovations are required in this domain to make Exascale possible. To help shrink the power efficiency gap, we argue that adaptive runtime systems can be exploited. The runtime system (RTS) can save significant power, since it is aware of both the hardware properties and the application behavior. We use application-centric analysis of different architectures to design automatic adaptive RTS techniques that save significant power in different system components, only with minor hardware support. In a nutshell, we analyze different modern architectures and common applications and illustrate that some system components such as caches and network links consume extensive power disproportionately for common HPC applications. We demonstrate how a large fraction of power consumed in caches and networks can be saved using our approach automatically. In these cases, the hardware support the RTS needs is the ability to turn off ways of set-associative caches and network links. We also present some required RTS techniques, such as recognizing the running application’s pattern using pattern recognition to predict its future and adapt the hardware appropriately. Furthermore, we address two types of prevalent heterogeneity: utilization of accelerator devices and process variation. To study accelerators, we analyze and optimize an example application on a heterogeneous architecture and demonstrate techniques for efficient mapping on different devices (CPU and GPU). To address process variation challenges, we develop accurate models that let the RTS schedule efficiently in the presence speed and power consumption variation. Using the models, we develop a novel scheduling framework that uses integer linear programming to enforce different performance and power consumption constraints

CiteSeerX

Illinois Digital Environment for Access to Learning and Scholarship Repository